Text Classification

The purpose of the text classification is to create a classifier that can classify a tweet as 'pro-protester' or 'pro-police'. These classifications will be used to calucate the percent of 'pro-protester' tweets for specific users. That data will then be used as a feature in a Random Forest Classification that will attempt to classify whether a user will remain engaged in the Ferguson conversion on twitter.

Resource and code help: Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins


Imports and connect to database


In [3]:
#imports and query mongodb

import json
import pymongo 
from bson import json_util # From  pymongo
import numpy as np
import pandas as pd
from datetime import datetime
from nltk.tokenize import word_tokenize

mdb = pymongo.MongoClient('mongodb://10.208.160.157')
db = mdb.ferguson
tweets = db.tweets
aug_tweets = db.tweets_aug

Obtain and prepare training data

Specific users were chosen to train the classifier. The users were chosen based on the the content of their tweets. To be chosen for the training set, the majority of the user's Ferguson related content needed to be clearly 'pro-protester' or 'pro-police', and the user's tweets should not consist of primarily retweets.


In [4]:
#queries the database for the user's tweets and creates a dataframe for the user

def getTweetsAndLabel(name, label):
    tweet_fields = ['id', '_iso_created_at', 'text' ]
    user = tweets.find({"user.screen_name": name}, fields = tweet_fields).sort([("_iso_created_at", -1 )])
    df = pd.DataFrame(list(user), columns = tweet_fields)
    df['label'] = label
    return df

A dataframe is generated with the tweets of each user, and each tweet is given a label based on whether that users is 'pro-protester' or 'pro-police'.

Dataframes for pro-protester users: 'p'


In [5]:
deray_df = getTweetsAndLabel('deray', 'p')

In [6]:
big6domino_df = getTweetsAndLabel('Big6domino', 'p')

In [7]:
wesleylowery_df = getTweetsAndLabel('WesleyLowery', 'p')

In [8]:
plussone_df = getTweetsAndLabel('plussone', 'p')

In [9]:
jsanbower_df = getTweetsAndLabel('JSanbower', 'p')

In [10]:
est_laced_up_df = getTweetsAndLabel('EST_Laced_Up', 'p')

In [11]:
lisaarmstrong_df = getTweetsAndLabel('LisaArmstrong', 'p')

In [12]:
judsontwit_df = getTweetsAndLabel('judsontwit', 'p')

In [13]:
#combine all pro-protester user dataframes
prot = [deray_df, big6domino_df, wesleylowery_df, plussone_df, est_laced_up_df, judsontwit_df, jsanbower_df, lisaarmstrong_df]
prot_df = pd.concat(prot)
len(prot_df)


Out[13]:
7215

Dataframes for pro-police users: 'c'


In [14]:
hotnostrilsrfun_df = getTweetsAndLabel('HotNostrilsrFun', 'c')

In [15]:
rburg63_df = getTweetsAndLabel('rburg63', 'c')

In [16]:
msully65_df = getTweetsAndLabel('msully65', 'c')

In [17]:
anna12061_df = getTweetsAndLabel('anna12061', 'c')

In [18]:
scrufey21_df = getTweetsAndLabel('Scrufey21', 'c')

In [19]:
jake_bradford88_df = getTweetsAndLabel('jake_bradford88', 'c')

In [20]:
johncardillo_df = getTweetsAndLabel('johncardillo', 'c')

In [21]:
#combine all pro-police dataframes
pol = [hotnostrilsrfun_df, rburg63_df, msully65_df, anna12061_df, scrufey21_df, jake_bradford88_df, johncardillo_df]
pol_df = pd.concat(pol)
len(pol_df)


Out[21]:
5884

In [22]:
#combine pro-protester and pro-police dataframes
users_df = []
df_all = pd.concat([prot_df, pol_df]).reset_index(drop = True)
len(df_all)


Out[22]:
13099

Check the balance of the data set.


In [28]:
print "Pro-protestor tweets make up {0:.0f}% of the labeled data set.".format((len(prot_df)*1.0 / len(df_all))*100)
print "Pro-police tweets make up {0:.0f}% of the labeled data set.".format((len(pol_df)*1.0 / len(df_all))*100)


Pro-protestor tweets make up 55% of the labeled data set.
Pro-police tweets make up 45% of the labeled data set.

Prepare the text

The text was prepared for training using the following steps:

  • tokenize tweets to words
  • convert words to lowercase
  • remove standard NLTK stop words, as these common words are not useful for classification
  • remove custom stop words, such as http, as these words are not useful for classification
  • remove words with length less than 2, as these words are not useful for classification
  • remove numbers, as they are not useful for classification-
  • include the 10 most common bigrams which were determing using the chi-square test

Note that many of these steps could be optimized further. For instance, more or less bigrams could be used, more stop words could be added, and the analysis could be restricted to a specific number of most important words. However, the accuracy of the classifier was high enough that further optimization of the text processing was not necessary.


In [30]:
#prepare text using steps described above
def prepText(df):
    #get the NLTK english stop words
    from nltk.corpus import stopwords
    english_stops = set(stopwords.words('english'))
    from nltk.collocations import BigramCollocationFinder
    from nltk.metrics import BigramAssocMeasures

    for i in range (0, len(df)):
        
        words = word_tokenize(df.loc[i, 'text'])
        words = [w.lower() for w in words]
        
        custom_stops = ['http', 'https', '...', 'amp']
        
        words = [word for word in words  if word not in english_stops and len(word) > 2 and word not in custom_stops and word.isalpha()]
        
        bigram_finder = BigramCollocationFinder.from_words(words)
        bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 10)
    
        words = (words + bigrams)
        
        #put the word list into the correct format for the classifier 
        words_dict = {word: True for word in words}     
        df.loc[i, 'train data'] = ''.join(str(words_dict))
    return df

Use parallel computing for text preparation

IPython parallel computing was used to make the text prepartion operation faster.


In [38]:
#set up parallel computing and confirm number of engines

from IPython import parallel

#make sure to enter the correct profile
clients = parallel.Client(profile='nbserver')

# use synchronous computations - all results must finish computing before any results are recorded
clients.block = True  
dview = clients.direct_view()
print clients.ids


[0, 1, 2, 3, 4, 5]

In [39]:
%px from nltk.tokenize import word_tokenize

dview.push((dict(prepText = prepText))) 

dview.scatter('df_all', df_all)
dview.execute('df_all.reset_index(drop = True, inplace = True)')
dview.execute('df_all = prepText(df_all)')
all_list = dview.gather('df_all')
df_all = pd.concat(all_list).reset_index(drop = True)
print len(df_all)
df_all.head()


13099
Out[39]:
id _iso_created_at text label train data
0 540731110779416576 2014-12-05 04:55:17 RT @PWeiskel08: Pepper balls in Boston. Lots o... p {u'pepper': True, u'balls': True, u'people': T...
1 540730147423272960 2014-12-05 04:51:28 We do not accept Chief Belmar's "apology" re: ... p {u'belmar': True, u'tamir': True, u'murder': T...
2 540729682274959360 2014-12-05 04:49:37 RT @PDPJ: #Ferguson protesters burn flag, Nati... p {u'remnants': True, u'national': True, u'walk'...
3 540727520530661376 2014-12-05 04:41:01 Why protest? Because we refuse to be scared in... p {u'refuse': True, (u'refuse', u'drown'): True,...
4 540726405986652160 2014-12-05 04:36:36 Issue #64 of the #Ferguson Protestor Newslette... p {(u'share', u'stay'): True, (u'protestor', u'n...

Generate training and test data sets

The data is split into random training and test sets using sklearn.cross_validation.train_test_split, with a test size of 0.33.


In [40]:
#prepare train and test sets
import ast
import collections
from nltk import metrics
data_list = [] * len(df_all)

def prepTrainText(df):
    import numpy as np
    import sklearn
    from sklearn.cross_validation import train_test_split
    
    #convert strings in dataframe to dicts
    for i in range(0, len(df)):
        data_list.append((ast.literal_eval(df.loc[i]['train data']), df.loc[i]['label']))
    train_data, test_data = sklearn.cross_validation.train_test_split(data_list, test_size = 0.33,random_state = 42)
    return train_data, test_data

In [41]:
train_set, test_set = prepTrainText(df_all)

In [42]:
print "The length of of the training set is %d." % len(train_set)
print "The length of the test set is %d." % len(test_set)


The length of of the training set is 8776.
The length of the test set is 4323.

Train and test the classifier

Since this is a text classification problem we chose to train a NLTK Naive Bayes classifier. Naive Bayes is suggested as good starting point in the Python 3 Text Processing with NLTK 3 Cookbook.

In addition, the scikit-learn algorithm cheat sheet (http://scikit-learn.org/stable/tutorial/machine_learning_map/) shows Naive Bayes as a good choice for labeled text data.


In [33]:
#train classifier and show most informative features
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features()


Most Informative Features
                 rioters = True                c : p      =     57.6 : 1.0
   (u'ferguson', u'cnn') = True                p : c      =     47.5 : 1.0
                    thug = True                c : p      =     39.6 : 1.0
(u'movement', u'ferguson') = True                p : c      =     36.2 : 1.0
            conservative = True                c : p      =     29.0 : 1.0
                  stolen = True                c : p      =     27.4 : 1.0
 (u'protest', u'leader') = True                c : p      =     25.7 : 1.0
              terrorists = True                c : p      =     23.3 : 1.0
                tchopstl = True                p : c      =     20.4 : 1.0
                    west = True                p : c      =     19.3 : 1.0

In [31]:
#function to calculate precision and recall
def precAndRec(classifier, test_feats):
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    precisions = {}
    recalls = {}
    for i, (feats, label) in enumerate(test_feats):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)
    
    for label in classifier.labels():
        precisions[label] = metrics.precision(refsets[label], testsets[label])
        recalls[label] = metrics.recall(refsets[label], testsets[label])
    
    return precisions, recalls

In [34]:
#test classifier and output results
import nltk.classify.util
print 'Accuracy: %0.2f' % nltk.classify.util.accuracy(classifier, test_set)

prec, rec = precAndRec(classifier, test_set)
print "Precision for 'p': %0.2f" % prec['p']
print "Precision for 'c': %0.2f" % prec['c']
print "Recall for 'p': %0.2f" % rec['p']
print "Recall for 'c': %0.2f" % rec['c']


Accuracy: 0.84
Precision for 'p': 0.88
Precision for 'c': 0.79
Recall for 'p': 0.82
Recall for 'c': 0.87

The accuracy, precision, and recall are all good. Therefore, no further work on text optmization will be done.

Scikit-learn also offers a Naive Byes classifier, and NLTK provides a wrapper class for it. Scikit-learn's Naive Bayes classifier has a smaller memory footprint. With the amount of data we're working with, memory considerations are important So, we also tried this classifier.


In [35]:
#sklearn Naive Bayes (usually better than NLTK, but slower)
from nltk.classify import scikitlearn

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
sknb_classifier = SklearnClassifier(MultinomialNB())
sknb_classifier.train(train_set)
print nltk.classify.util.accuracy(sknb_classifier, test_set)

prec, rec = precAndRec(sknb_classifier, test_set)
print "Precision for 'p': %0.2f" % prec['p']
print "Precision for 'c': %0.2f" % prec['c']
print "Recall for 'p': %0.2f" % rec['p']
print "Recall for 'c': %0.2f" % rec['c']


0.848947490169
Precision for 'p': 0.84
Precision for 'c': 0.87
Recall for 'p': 0.90
Recall for 'c': 0.79

The results are similiar. Because of the memory footprint, the sklearn Naive Bayes will be used for the classifications.


Classification sanity check

To confirm that the classifier is classifying tweets properly, additional users were identified as clearly pro-protester or pro-police. Their tweets were classified, and the results are below.

Note: 'p' = 'pro-protester' 'c' = 'pro-police'


In [37]:
#determine percent of user's tweets that are pro-protester
def testUser(name, classifier):
    df = getTweetsAndLabel(name, '')
    df_ready = prepText(df)
    for i in range(0, len(df_ready)):
        df_ready.loc[i, 'label'] = classifier.classify(ast.literal_eval(df_ready.loc[i]['train data']))
    p = len(df_ready[df_ready['label'] == 'p'])
    c = len(df_ready[df_ready['label'] == 'c'])
    t = p + c
    perc_p = p*1.0 / t
    return perc_p

In [38]:
x = testUser('AntonioFrench', sknb_classifier)
print x
print "Expected: 'p'"


0.826251896813
Expected: 'p'

In [39]:
x = testUser('FredSanford13', sknb_classifier)
print x
print "Expected: 'c'"


0.313455657492
Expected: 'c'

In [40]:
x = testUser('CAC8438', sknb_classifier)
print x
print "Expected: 'c'"


0.194331983806
Expected: 'c'

In [41]:
x = testUser('OpFerguson', sknb_classifier)
print x
print "Expected: 'p'"


0.843704245974
Expected: 'p'

In [42]:
x = testUser('ryanjreilly', sknb_classifier)
print x
print "Expected: 'p'"


0.820408163265
Expected: 'p'

In [43]:
x = testUser('Nettaaaaaaaa', sknb_classifier)
print x
print "Expected: 'p'"


0.911217437533
Expected: 'p'

In [44]:
x = testUser('timjacobwise', sknb_classifier)
print x
print "Expected: 'p'"


0.822222222222
Expected: 'p'

In [45]:
x = testUser('1969WAR1971', sknb_classifier)
print x
print "Expected: 'c'"


0.386363636364
Expected: 'c'

In [46]:
x = testUser('rdipego', sknb_classifier)
print x
print "Expected: 'c'"


0.472222222222
Expected: 'c'

Each user was classified as expected. If 'p' was expected, the percent of pro-protester tweets should be above 0.5. If 'c' was expected, it should be below 0.5.


Notes on the data used to train the text classifier:

In other parts of this project, restrictions are being made to the data that have not been made to the data set used to train the text classifier. These restrictions include only looking at tweets with the hashtag #Ferguson or #ferguson (versus containing the word F/ferguson, even if not as a hashtag), and restricting the data to tweets from August.

For the purpose of text classification, the context of the tweets is the most important factor. The presence of #Ferguson or #ferguson hastags is not important, nor is the date of the tweets. The context of the tweets remains largely the same through the time period being analysed. The high accuracy, precision, and recall of the classifier confirms that these restrictions of the data are not necessary for text classification.


Prepare for classification of the users in the August data set

Each of the users in the August data will have their tweets classified, and the percent of their tweets that are pro-protester will be calculated.


In [72]:
#get users from August data
df_aug = pd.read_csv('/home/data/august_reduced.csv', error_bad_lines = False)

In [76]:
#function to get tweets for each user using August data set
def getTweetsAndLabelAug(name, label):
    df_users_tweets = df_aug[df_aug['user.screen_name'] == name].reset_index(drop = True)
    df_users_tweets ['label'] = label
    return df_users_tweets

In [77]:
#
def testUserAug(name, classifier):
    df = getTweetsAndLabelAug(name, '')
    df_ready = prepText(df)
    for i in range(0, len(df_ready)):
        df_ready.loc[i, 'label'] = classifier.classify(ast.literal_eval(df_ready.loc[i]['train data']))
    p = len(df_ready[df_ready['label'] == 'p'])
    c = len(df_ready[df_ready['label'] == 'c'])
    t = p + c
    perc_p = p*1.0 / t
    return perc_p

A few additional users identified during EDA were tested as a sanity check.


In [78]:
x = testUserAug('gerfingerpoken2', sknb_classifier)
print x
print "Expected: 'c'"


0.000337268128162
Expected: 'c'

In [79]:
x = testUserAug('carolynsbuddy', sknb_classifier)
print x
print "Expected: 'p'"


0.847727272727
Expected: 'p'

In [80]:
x = testUserAug('PhilDeCarolis', sknb_classifier)
print x
print "Expected: '?'"


0.721700717835
Expected: '?'

@PhilDeCarolis is interesting. He appears to be a liberatarian and supporter of Ron Paul. Most of his tweets regarding Ferguson are retweets of news headlines. He doesn't seem to outright support either the protestors or the police.


Classify tweets from August

The tweets of 5,000 random users with ten or more tweets in August will be classified. The sample number of 5,000 was chosen based on time and server memory constraints. The set of tweets from 5,000 users would take about 2 - 2.5 hours to classify, and would occasionally cause the IPython kernel to crash. Larger servers were tried, and 5,000 was the limit to classify in a reasonable amount of time without kernel crashes on the largest server available to the team.

IPython's parallel computing would have been nice to use for this lengthy process. However, when we attempted to scatter the work to the available cores, the process was stopped by memory errors. This may have been caused by the amount of data that needed to be pushed to each core to run the process.


In [81]:
#Generate dataframe with users from August
users = pd.DataFrame({'count' : df_aug.groupby( [ "user.screen_name"] ).size()}).reset_index()

In [82]:
#restrict to users with tweet counts greater than 10, and randomly select 5000
users = users[pd.notnull(users['count'])]
users = users[users['count'] >= 10]
rows = np.random.choice(users.index.values, 5000, replace=False)
ran_users = users.ix[rows].reset_index(drop = True)

In [83]:
print len(ran_users)
ran_users.head()


5000
Out[83]:
user.screen_name count
0 ItriSamele 11
1 gohogsgirl 28
2 averrer 17
3 17147578976 15
4 fucking_ninoX_X 12

In [84]:
#function to classify tweets and caluclate percent per user
def label_all(df):
    for i in range(0, len(df)):
        x = testUserAug(df.loc[i]['user.screen_name'], sknb_classifier)
        df.loc[i, 'perc_p'] = x
    return df

In [85]:
#classify tweets for users
final = label_all(ran_users)

In [86]:
final.head()


Out[86]:
user.screen_name count perc_p
0 ItriSamele 11 0.727273
1 gohogsgirl 28 0.857143
2 averrer 17 0.823529
3 17147578976 15 1.000000
4 fucking_ninoX_X 12 0.833333

In [87]:
len(final)


Out[87]:
5000

In [93]:
#save dataframe for use in another notebook
final.to_pickle('final_aug_percents.pkl')

In [52]:
#calculate number of users that are pro-protester and number that are pro-police
p = len(final[final['perc_p'] >= 0.5])
c = len(final[final['perc_p'] < 0.5])

print "Percent of users that are pro-protester: {0:.0f}%".format(p*1.0/(p+c)*100)
print "Percent of users that are pro-police: {0:.0f}%".format(c*1.0/(c+p)*100)


Percent of users that are pro-protester: 91%
Percent of users that are pro-police: 9%